Statistical Phrases in Automated Text Categorization

نویسندگان

  • Maria Fernanda Caropreso
  • Stan Matwin
  • Fabrizio Sebastiani
چکیده

In this work we investigate the usefulness of n-grams for document indexing in text categorization (TC). We call n-gram a set tk of n word stems, and we say that tk occurs in a document dj when a sequence of words appears in dj that, after stop word removal and stemming, consists exactly of the n stems in tk, in some order. Previous researches have investigated the use of n-grams (or some variant of them) in the context of specific learning algorithms, and thus have not obtained general answers on their usefulness for TC. In this work we investigate the usefulness of n-grams in TC independently of any specific learning algorithm. We do so by applying feature selection to the pool of all α-grams (α ≤ n), and checking how many n-grams score high enough to be selected in the top σ α-grams. We report the results of our experiments, using several feature selection functions and varying values of σ, performed on the Reuters-21578 standard TC benchmark. We also report results of making actual use of the selected n-grams in the context of a linear classifier induced by means of the Rocchio method. Categories and subject descriptors: H.3.3 [Information storage and retrieval]: Information search and retrieval Information filtering; H.3.3 [Information storage and retrieval]: Systems and software Performance evaluation (efficiency and effectiveness); I.2.3 [Artificial Intelligence]: Learning Induction Terms: Algorithms, Experimentation, Theory

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature Selection and Feature Extract ion for Text Categorization

The effect of selecting varying numbers and kinds of features for use in predicting category membership was investigated on the Reuters and MUC-3 text categorization data sets. Good categorization performance was achieved using a statistical classifier and a proportional assignment strategy. The optimal feature set size for word-based indexing was found to be surprisingly low (10 to 15 features...

متن کامل

A Learner-Independent Evaluation of the Usefulness of Statistical Phrases for Automated Text Categorization

In this work we investigate the usefulness of n-grams for document indexing in text categorization (TC). We call n-gram a set gk of n word stems, and we say that gk occurs in a document dj when a sequence of words appears in dj that, after stop word removal and stemming, consists exactly of the n stems in gk, in some order. Previous researches have investigated the use of n-grams (or some varia...

متن کامل

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

Resolving Ambiguous Preposition Phrase Using Genetic Algorithm

Text mining refers to the process of discovering interesting and non trivial patterns or knowledge embedded in unstructured text documents from a fixed domain. It is also known as knowledge discovery from text databases. Text mining tasks include text categorization, text clustering, concept/entity extraction, document summarization and entity relation modelling. Extracting concept/fact from th...

متن کامل

Genre Analysis and the Automated Extraction of Arguments from Student Essays

A full understanding of text is out of reach of current human language technology. However, a shallow Natural Language Processing (NLP) approach can be used to provide automated help in the assessment of essays: our approach uses genre, cue phrases and a set of patterns. Cue phrases, with their associated semantics, are used in conjunction with patterns to identify categories of argumentation p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000